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Abstract 

In this paper, we consider the problem of approximating the densest subgraph in the dynamic graph 
stream model. In this model of computation, the input graph is defined by an arbitrary sequence of edge 
insertions and deletions and the goal is to analyze properties of the resulting graph given memory that 
is sub-linear in the size of the stream. We present a single-pass algorithm that returns a (1 -f e) approx¬ 
imation of the maximum density with high probability; the algorithm uses 0(e“^n polylog n) space, 
processes each stream update in polylog(n) time, and uses poly(n) post-processing time where n is the 
number of nodes. The space used by our algorithm matches the lower bound of Bahmani et al. (PVLDB 
2012) up to a poly-logarithmic factor for constant e. The best existing results for this problem were 
established recently by Bhattacharya et al. (STOC 2015). They presented a (2 -f e) approximation algo¬ 
rithm using similar space and another algorithm that both processed each update and maintained a (4 -f e) 
approximation of the current maximum density in polylog(n) time per-update. 


1 Introduction 

In the dynamic graph stream model of computation, a sequence of edge insertions and deletions defines an 
input graph and the goal is to solve a specific problem on fhe resulting graph given only one-way access fo 
fhe inpuf sequence and limited working memory. Mofivafed by fhe need fo design efficienf algorifhms for 
processing massive graphs, over fhe lasf four years fhere has been a considerable amounf of work designing 
algorifhms in fhis model lfTI- l5lf8]l9ll 11 II 1 81I20II22[|23II251I26I . Specific resulfs include testing edge connecfivify 
ll^ and node connecfivify |[20ll . consfrucfing specfral sparsifiers till , approximating fhe densesf subgraph HI, 
maximum mafching ll5Mlllll25ll . correlafion clusfering HI, and esfimafing fhe number of friangles (26^ . For 
a recenf survey of fhe area, see 1281. 

In fhis paper, we consider fhe densesf subgraph problem. Lef Gjj be fhe induced subgraph of graph 
G = (V E) on nodes U. Then fhe density of Gjj is defined as 

d{Gu) = \E{Gu)\/\U\, 

where E{Gu) is fhe sef of edges in fhe induced subgraph. We define fhe maximum density as 

d* = maxdiG;/) . 
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and say that the corresponding subgraph is the densest subgraph. The densest subgraph can be found 
in polynomial time |[T0l[T^[T^l24]l and more efficient approximation algorithms have been designed ifTOl . 
Finding dense subgraphs is an important primitive when analyzing massive graphs; applications include 
community detection in social networks and identifying link spam on the web, in addition to applications on 
financial and biological dafa. See ll27l for a survey of applicafions and exisfing algorifhms for fhe problem. 

1.1 Our Results and Previous Work 

We presenf a single-pass algorifhm fhaf refurns a (1 + e) approximation wifh high probabilil>Q. For a graph 
on n nodes, fhe algorifhm uses fhe following resources: 

• Space: 0{e~^n polylog n). The space used by our algorifhm mafches fhe lower bound of Bahmani el 
al. l|71 up fo a poly-logarilhmic factor for consfanl e. 

• Per-update time: polylog(n). We nofe fhaf Ihis is fhe worsf-case updale time ralher lhan amortized 
over all fhe edge insertions and delefions. 

• Post-processing time: poly(n). This will follow by using any exacf algorifhm for densest subgraph 
II101I161I19I on the subgraph generated by our algorithm. 

The most relevant previous results for the problem were established recently by Bhattacharya et al. HI. 
They presented two algorithms that use similar space to our algorithm and process updates in poly log (n) 
amortized time. The first algorithm returns a (2+e) approximation of the maximum density of the final graph 
while fhe second (fhe more fechnically challenging resulf) oufpufs a (4 + e) approximation of fhe currenf 
maximum densify affer every updale while still using only polylog(n) time per-update. Our algorithm 
improves the approximation factor to (1 -I- e) while keeping the same space and update time. It is possible to 
modify our algorithm to output a (1 + e) approximation to the current maximum density after each update 
but the simplest approach would require the post-processing step to be run after every edge update and this 
would not be efficient. 

Bhattacharya et al. were one of the first to combine the space restriction of graph streaming with the 
fast update and query time requirements of fully-dynamic algorithms from the dynamic graph algorithms 
community. Epasto, Lattanzi, and Sozio ifldll present a fully-dynamic algorithm that returns a (2 -|- e) ap¬ 
proximation of the current maximum density. Other relevant work includes papers by Bahmani, Kumar, and 
Vassilvitskii Q and Bahmani, Goel, and Munagala Q. The focus of these papers is on designing algorithms 
in the MapReduce model but the resulting algorithms can also be implemented in the data stream model if 
we allow multiple passes over the data. 

1.2 Our Approach and Paper Outline 

The approach we take in this paper is as follows. In Section |2l we show that if we sample every edge of a 
graph independently with a specific probabilify fhen we generafe a graph fhaf is a) sparse and b) can be used 
fo esfimafe fhe maximum densify of fhe original graph. This is nol difficulf fo show buf requires care since 
fhere are an exponenfial number of subgraphs in fhe subsampled graph fhaf we will need fo consider. 

In Secfion [3l we show how fo perform fhis sampling in fhe dynamic graph sfream model. This can be 
done using fhe Iq sampling primitive ifT^I^ fhaf enables edges to be sampled uniformly from fhe sef of 

'Throughout this paper, we say an event holds with high probability if the probability is at least 1 — n~‘^ for some constant 
c > 0. 
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edges that have been inserted but not deleted. However, a naive application of this primitive would neces¬ 
sitate H(n) per-update processing. To reduce this to O(polylog n) we reformulate the sampling procedure 
in such a way that it can be performed more efficiently. This reformulation is based on creating multiple 
partitions of the set of edges using pairwise independent hash functions and then sampling edges within each 
group in the paitition. The use of multiple paititions is somewhat reminiscent of that used in the Count-Min 
sketch |[T3l . 

Remark. Independently of our work, Esfandiari, Hajiaghayi, and Woodruff ifTSl also proved a similar 
result to that presented in this paper. Their result is also based on uniformly sampling edges but their 
approach for ensuring fast update time differs and may be of independent interest. 


2 Subsampling Approximately Preserves Maximum Density 


In the section, we consider properties of a random subgraph of the input graph G. Specifically, lef G' be fhe 
graph formed by sampling each edge in G independenfly wifh probabilify p where 

-2i n 

p = ce log n ■ — 
m 

for some sufficienlly large conslant c > 0 and 0 < e < 1/2. We may assume lhal m is sufficienlly large 
such lhaf p < 1 because olherwise we can reconslrucl fhe enlire graph in fhe allolled space using slandard 
resulfs from fhe sparse recovery liferalure ifTTl . 

We will prove fhal, wifh high probabilify, fhe maximum density of G can be esfimaled up fo factor (1 + e) 
given G'. While if is easy fo analyze how fhe densily of a specific subgraph changes after fhe edge sampling, 
we will need to consider all 2"^ possible induced subgraphs and prove properties of fhe subsampling for all 
of Ihem. 

The nexf lemma shows fhaf d{G'jj) is roughly proporfional fo d{Gjj) if d{Gjj) is “large” whereas if 
d{Gu) is “small” Ihen d{G'jj) will also be relatively small. 

Lemma 1. Let U be an arbitrary set ofk nodes. Then, 

F [d{G'jj) > pd*/lO] < ifd{Gu) < d*/60 

P [\d{G'u) - pdiGu)\ > epd{Gu)] < ifdiGu) > d*/60 . 

Proof. We sfart by considering fhe density of fhe entire graph d{G) = mjn and Iherefore conclude lhaf fhe 
maximum density, d*, is al leasf m/n. Hence, p > (ce“^ \ogn)/d*. 

Lef X be fhe number of edges in G'jj and note fhal E [W] = pkd{Gu). Firsf assume d{Gjj) < d*/60. 
Then, by an application of fhe Chernoff Bound (e.g., Theorem 4.4]), we observe lhal 

F [d{G'u) > pd*/lQ] = P[X > pkd*/lQ] < 2-p“‘/io < 

and Ibis is al mosl for sufficienlly large conslanf c. 

Next assume d{Gu) > d*/60. Hence, by an application of an alternative form of the Chernoff Bound 
(e.g., |[29l Theorem 4.4 and 4.5]), we observe that 


F[\d{G'u)-pd{Gu)\>epd{Gu)] = 

< 

< 

< 


F[\X -pkd{Gu)\ > epkd{Gu)] 
2 exp{—€^pkd{Gu)/3) 

2 exp(—e^p/c(i*/180) 

2exp(—cfc(logn)/180) . 
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and this is at most 2n for sufficiently large constant c. □ 

Corollary 2. With high probability, for all U C V: 

d{G'u) > {1 - e)pd* => d{Gu)>^^-d* . 

Proof. There are (^) < subsets of V that have size k. Hence, by appealing to Lemma [J and the union 
bound, with probability at least 1 — 2n“®^, the following two equations hold, 

d(G'^) > pd*/10 => d{Gu) > d*/60 

d{Gu)>d*/00 => 

P(1 + e) 

for all [/ CV such that \U\ = k. Since (1 — e)pd* > pd*/lO, together these two equations imply 
d{G'u)>{l-e)pd* => d{Gu)>^^^>\^^-d* 

for all sets U of size k. Taking the union bound over all values of k establishes the corollary. □ 

We next show that the densest subgraph in G' corresponds to a subgraph in G that is almost as dense as 
the densest subgraph in G. 

Theorem 3. Let U' = argmaxj; d{G^). Then with high probability, 

^^■d* < d(Gu') < d* . 

Proof Let U* = argmaxf/ d{Gjj). By appealing to Lemma[T] we know that d{G'jj„) > (1 — e)pd* with 
high probability. Therefore 

d{G'u') > diG'u*) > (1 - e)pd* , 

and the result follows by appealing to Corollary |2] □ 

3 Implementing in the Dynamic Data Stream Model 

In this section, we show how to sample each edge independently with the prescribed probability in the 
dynamic data stream model. The resulting algorithm uses 0(e“^n polylog n) space. The near-linear depen¬ 
dence on n almost matches the Q{n) lower bound proved by Bahmani et al. Q. The main theorem we prove 
is: 

Theorem 4. There exists a randomized algorithm in the dynamic graph stream model that returns a {l-\- e)- 
approximation for the density of the densest subgraph with high probability. The algorithm uses 0{e~‘^n polylog n) 
space and O (polylog n) update time. The post-processing time of the algorithm is polynomial in n. 

To sample the edges with probability p in the dynamic data stream model there are two main challenges: 

1. Any edge we sample during the stream may subsequently be deleted. 

2. Since p depends on m, we do not know the value of p until the end of the stream. 
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To address the first challenge, we appeal to an existing result on the sampling technique 11211 : there exists 
an algorithm using poly log (n) space and update time that returns an edge chosen uniformly at random 
from the final set of edges in the graph. Consequently we may sample r edges uniformly at random using 
0(r polylog n) update time and space. To address the fact we do not know p apriori, we could set r » 
pm = ce“^n log n, and then, at the end of the stream when p and m are known a) choose X ~ Bin(m,p) 
where Bin(-, •) denotes the binomial distribution and b) randomly pick X distinct random edges amongst 
the set of r edges sampled (ignoring duplicates). This approach will work with high probability if r is 
sufficiently large since X is tightly concentrated around E [X] = pm. However, a naive implementation of 
this algorithm would require oj{n) update time. The main contribution of this section is to demonstrate how 
to ensure O(polylog n) update time. 

3.1 Reformulating the Sampling Procedure 

We first describe an alternative sampling process that, with high probability, returns a set of edges S where 
each edge in S has been sampled independently with probability p as required. The purpose of this alterna¬ 
tive formulation is that it will allow us to argue that it can be emulated in the dynamic graph stream model 
efficiently. 

Basic Approach. The basic idea is to partition the set of edges into different groups and then sample edges 
within groups that do not contain too many edges. We refer to such groups as “small”. We determine which 
of the edges in a small group are to be sampled in two steps: 

• Fix the number X of edges to sample: Let X ~ Bin( 5 ,p) where g is the number of edges in the 
relevant group. 

• Fix which X edges to sample: We then randomly pick X edges without replacement from the relevant 
group. 

It is not hard to show that this two-step process ensures that each edge in the group is sampled independently 
with probability p. At this point, the fate of all edges in small groups has been decided: they will either be 
returned in the final sample or definitely not returned in the final sample. 

We nexf consider another partition of the edges and again consider groups that do not contain many 
edges. We then determine the fate of the edges in such groups whose fate has not hitherto been determined. 
We keep on considering different partitions until every edge has been included in a small group and has had 
its fate determined. 

Lemma 5. Assume for every edge there exists a partition such that the edge is in a small group. Then 
the distribution over sets of sampled edges is the same as the distribution had each edge been sampled 
independently with probability p. 

Proof. The proof does not depend on the exact definition of “small” and the only property of the partitions 
that we require is that every edge is in a small group of some partition. We henceforth consider a fixed sef 
of partitions with this property. 

We first consider the jth group in the zth partition. Let g be the number of edges in this group. For any 
subset Q oi I edges in this group, we show that the probability that Q is picked by the two-step process 
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above is indeed p^. 
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P [Ve € Q, e is picked] = P [Ve G Q, e is picked |X = t] P [X = t] 

t=i ^ ^ 

and hence edges within the same group are sampled independently with probability p. Furthermore, the 
edges in different groups of the same partition are sampled independently from each other. 

Let /(e) be the first partition in which e is placed in a group that is small and let Wi = {e : /(e) = 
i}. Restricting Q to edges in Wi in the above analysis establishes that edges in each Wi are sampled 
independently. Since /(e) is determined by the fixed sef of parfifions rafher fhan fhe randomness of fhe 
sampling procedure, we also conclude fhaf edges in differenf Wi are sampled independenfly. As we assume 
fhaf every edge belongs fo af leasf one small group in some partition, if we lef r be fhe fofal number of 
parfifions, fhen partition fhe sef of edges E. Hence, all edges in E are sampled independenfly 

wifh probabilify p. □ 


Details of Alternative Sampling Procedure. The partitions considered will be determined by pairwise 
independent hash functions and we will later argue that it is sufficient to consider only O(logn) partitions. 
Each hash function will partition the m edges into ne“^ groups. In expectation the number of edges in 
a group will be e^m/n and we define a group to be small if it contains at most t = edges. We 

therefore expect to sample less than Ape^m/n = 4c log n edges from a small group. We will abort the 
algorithm if we attempt to sample significantly more edges than this from some small group. The procedure 
is as follows: 


• Let /ii,..., : ( 2 ) pairwise independent hash functions where r = 10 log n. 

• Each hi defines a partition of E comprising of sets of the form 

= {ee E : hi{e) = j} . 

Say Eij is small if it is of size at most t = Ae^mln. Let Di be the set of all edges in the small sets 
determined by hi. 


• Eor each small Eij, let 


Xij = Bin{\Eij\,p) 


and abort if 


j > T where r = 24c log n . 

Let Sij be a set of Xij edges sampled without replacement from Eij. 


• Let S be set of edges that were sampled among some Di that are not in Di U H 2 U ... U Hj-i, i.e., 
edges whose fate had not already been determined. 


S = \^{e G Di : e G ^jSij and e 0 Hi U ZI 2 U ... U Hj_i} 

i=l 
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Analysis. There are two main things that we need to show to establish that the above process emulates 
our basic sampling approach with high probability. First, we will show that with high probability for every 
edge e there exists i and j such that e € Eij and Eij is small. This ensures that we will make a decision 
on whether e is included in the final sample. Second, we will show that it is very unlikely we abort because 
some Xij is too large. 

Lemma 6. With probability at least 1 — n“®, for every edge e there exists i such that e G Ei^ and Eij is 
small. 

Proof. Fix i G [r] and let j = hi{e). Then E < 1 + e^(m — l)/n < 2e^mln assuming m > e“^n. 

By an application of the Markov bound: 

IP > 4me^/n] < 1/2 . 

Since each hi is independent, 

P jn for all i] < 1/2'’ = l/n^° . 

Therefore by the union bound over all m <'n? edges there exists a good partition for each e with probability 
at least 1 — n“®. □ 

Lemma 7. With high probability, all Xij are less than r = 24c log n. 

Proof. Since Eij is small then E [Xjj] = \Eij\p < Ae^pm/n = 4clog n. Hence, by an application of the 
Chernoff bound, 

P [Xij > 24clog n] < < n-^° . 

Taking the union bound over all 10 log n values of i and e“^n values of j establishes the lemma. □ 

3.2 The Dynamic Graph Stream Algorithm 

We are now ready to present the dynamic graph stream algorithm. To emulate the above sampling process 
in the dynamic graph stream model, we proceed as follows: 

1. Pre-Processing: Pick the hash functions /ii, / 12 ,..., /ir- These define the sets E'jj. 

2. During One Pass: 

• Compute the size of each Eij and m. Note that m is necessary to define p. 

• Sample r edges 5-^ uniformly without replacement from each Eij. 

3. Post-Processing: 

• Randomly determine the values Xj j based on the exact values of |i?i j-1 and m for each Eij that 
is small. If Xij exceeds r then abort. 

• Let Sij be a random subset of S[ j of size Xij. 

• Return p~^ max[/ d{G'u) where G' is the graph with edges: 

r 

S' = |^{e G : e G UjSjj and e ^ DiU D 2 U .. .0 Hi-i} 

i=l 
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Note that is possible to compute |-Ej j| using a counter that is incremented or decremented whenever an 
edge e is added or removed respectively that satisfies hi{e) = j. We may evaluate pairwise independent 
hash functions in O(polylogn) time. The exact value of max; 7 d(G'^) can be determined in polynomial 
time using the result of Charikar ifTOl . To prove Theorem IH it remains to describe how to sample r edges 
without replacement from each Eij. 

Sampling Edges Without Replacement Via ^o'Sampling. To do this, we use the ^o-sampling algorithm 
of Jowhari et al. m. Their algorithm returns, with high probability, a random edge from Eij and the 
space and update time of the algorithm are both O(polylogn). Running r independent instantiations of this 
algorithm immediately enables us to sample r edges uniformly from Eij with replacement. 

However, since their algorithm is based on linear sketches, there is an elegant way (at least, more elegant 
than simply over sampling and removing duplicates) to ensure that all samples are distinct. Specifically, lef 
X be fhe characferisfic vecfor of fhe sef Eij. Then, r insfanfiafions of fhe algorifhm of Jowhari ef al. ll^ 
generafe random projecfions 

^l(x) , yl 2 (x) , . . . , A(x) 

of X such fhaf a random non-zero enfry of x (which corresponds fo an edge from Eij) can be idenlified by 
processing each ,Aj(x). Lef ei be fhe edge reconsfrucfed from ,Ai(x). Rafher fhan reconsfrucfing an edge 
from ,A 2 (x), which could be fhe same as ei, we instead reconsfrucf an edge 62 from 

^ 2 (x) - ^ 2 (iei) = ^ 2 (x - iei) 

where is fhe characferisfic vecfor of fhe sef {ei}. Nofe fhaf 62 is necessarily differenl from ei since x — ig 
is fhe characferisfic vecfor of fhe sef Eij \ {ei}. Similarly we reconsfrucf ej from 

Aj{x.) - Aj{iei) - Ajiie^) - ... - = ^ 2 (x - iei - . . . - 

and nofe fhaf ej is necessarily disfincf from {ei, 62 ,..., 

4 Conclusion 

We presenfed fhe firsf algorifhm for esfimafing fhe densify of fhe densesf subgraph up fo a (1 + e) facfor 
in fhe dynamic graph sfream model. Our algorifhm used 0(e“^n polylogn) space, polylog(n) per-updafe 
processing lime, and poly(n) posf-processing fo relurn fhe eslimale. The mosf relevanl previous resulls, 
by Bhallacharya el al. HI, were a (2 + e) approximation in similar space and a (4 + e) approximation wilh 
poly log (n) per-updafe processing lime fhaf also oufpufs an eslimale of fhe maximum densify affer each edge 
inserfion or delelion. A nalural open queslion is whelher if is possible fo use ideas confained in fhis paper 
fo improve fhe approximalion facfor for fhe problem of mainlaining a running eslimale of fhe maximum 
density. 
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